RNN has many different variations with distinct structures. Shown in Fig. 9.10 is one of the most common architectures. On the left is the schematic of a unit. The one in the middle shows a more detailed flow of the data through the unit. The subplot on the right illustrates how the unit unfolds over time.
Figure 9.10: Structure and workflow of RNN
The symbols in Fig. 9.10 have the following meanings: x^(t)x^{t} is the output at step tt in the time series; h^(t)h^{t} is the state of the hidden layer at tt, which is determined by both x^(t)x^{t} and h^(t-1)h^{t-1} according to the above network architecture; o^(t)o^{t} is the output at tt, which is determined by the hidden state at the current step tt; L^(t)L^{t} is the loss function at tt; y^(t)y^{t} is the actual output (or label) at tt; tilde(y)^(t)\tilde{y}^{t} is the predicted output at tt; sigma_(1)\sigma_{1} and sigma_(2)\sigma_{2} are the hidden layer activation function and output activation function, respectively; U,WU, W, and VV are the arrays that include the network weights to be predicted.
These weights, i.e., U,WU, W, and VV, are shared through time. That is, though there are "multiple units" as the RNN is unfolded over time, there is essentially one unit. Thus, the backpropagation of RNN updates the weights of the same unit.
The above symbols were given without discussing the shape of the input xx, which is handled as one sample here. In many cases, xx has multiple attributes and thus needs to be formulated using a 1D array. In the following deductions for backpropagation, we will represent xx as a column array with JJ elements (for the JJ attributes). In this way, the derived equations can be easily used for coding. The output yy (actual) and tilde(y)\tilde{y} (predicted) are 1D arrays with a shape of K xx1K \times 1. In implementation, we will still need to determine the shape of vec(h)\vec{h}. Let us assume vec(h)\vec{h} has a shape of N xx1N \times 1. The shapes of other weight arrays can be determined by those of vec(x), vec(y)\vec{x}, \vec{y}, and vec(h)\vec{h}. The data flow, especially the change of the shapes of arrays for data and weights, can be tracked in the following deductions.
9.3.1. Forward Pass
Let us first work on the forward pass. The following three operations will turn the input vec(x)_(J xx1)\vec{x}_{J \times 1} into tilde(y)\tilde{y}.
where sigma_(1)\sigma_{1} and sigma_(2)\sigma_{2} are the two activation functions used for the hidden layer and output layer, respectively.
The loss function at time step ℓ^(t)\ell^{t} measures the distance between tilde(vec(y))^(t)\tilde{\vec{y}}^{t} and vec(y)^(t)\vec{y}^{t}. For example, the log likelihood function is usually used for classification tasks, and the following mean square root error function is used for regression tasks.
RNN's backward pass in which the network weights are updated via methods such as a gradient descent method is termed Backpropagation Through Time (BPTT). The deduction of backpropagation rules, i.e., equations for the backpropagation of the gradients from the difference between the predicted and actual outputs, depends on the loss function and activation functions.
In the following, we will derive the general backpropagation equations. In this deduction, we will use the mean square root error for loss, tanh for sigma_(1)\sigma_{1}, and Sigmoid for sigma_(2)\sigma_{2} to show more detailed equations that can be used for implementations. Accordingly, the three functions have the following derivatives, which are needed in the backpropagation.
The calculation of the gradients for bar(W), bar(U)\bar{W}, \bar{U}, and vec(b)\vec{b} is more complicated. As illustrated in the unfolded RNN, the gradient at a time step tt comes from two parts: the loss caused by tilde(tilde(y))^(t)- vec(y)^(t)\tilde{\tilde{y}}^{t}-\vec{y}^{t} at the current step and that from the next time step via the hidden layer connection ( vec(h)^(t)\vec{h}^{t} and vec(h)^(t+1)\vec{h}^{t+1} ). Thus, the gradient calculation will need to start from the last time step tau\tau and then move backward (time) step by step. bar(W), bar(U)\bar{W}, \bar{U}, and vec(b)\vec{b} are contained in the function for vec(h)^(t)\vec{h}^{t}. Thus, the calculation of the gradients for these parameters will also involve (delℓ)/(del vec(h)^(t))\frac{\partial \ell}{\partial \vec{h}^{t}} when applying the chain rule. For example, for bar(W)\bar{W}, we have (delℓ)/(del( bar(W)))_(N xx N)=(delℓ)/(del( vec(h)))_(N xx1)*(del( vec(h)))/(del( bar(W)))_(1xx N)\frac{\partial \ell}{\partial \bar{W}}{ }_{N \times N}=\frac{\partial \ell}{\partial \vec{h}}{ }_{N \times 1} \cdot \frac{\partial \vec{h}}{\partial \bar{W}}{ }_{1 \times N}. To simplify the deduction, we can define the following intermediate variable:
where the function diag( vec(h))\operatorname{diag}(\vec{h}) converts a vector vec(h)\vec{h} with a shape of (N(N,)or(N,1)) or (N, 1) to a 2D array with a shape of (N,N)(N, N) whose diagonal is vec(h)\vec{h}. vec(delta)^(t+1)\vec{\delta}^{t+1} has a special case when t=taut=\tau. At this last step, there is no gradient coming from the next step. Thus, it can be calculated directly.
When implementing the above backward pass, we can start from the last time step. First, we calculate (delℓ)/(del( vec(c))),(delℓ)/(del( bar(V)))\frac{\partial \ell}{\partial \vec{c}}, \frac{\partial \ell}{\partial \bar{V}}, and vec(delta)^(t)\vec{\delta}^{t}. Then, we can compute (delℓ)/(del( bar(W))),(delℓ)/(del U)\frac{\partial \ell}{\partial \bar{W}}, \frac{\partial \ell}{\partial U}, and (delℓ)/(del( vec(b)))\frac{\partial \ell}{\partial \vec{b}}.
All the model parameters, i.e., vec(c) bar(V), bar(W), bar(U)\vec{c} \bar{V}, \bar{W}, \bar{U}, and vec(b)\vec{b}, which can be represented using a general parameter array, i.e., theta\theta, can be updated using the gradient descent: